Skip to content

Pd support #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 15, 2025
Merged

Pd support #94

merged 7 commits into from
Jul 15, 2025

Conversation

mayabar
Copy link
Collaborator

@mayabar mayabar commented Jul 14, 2025

No description provided.

mayabar added 3 commits July 14, 2025 16:08
…ecode fields

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Update config_test to use a function to create configuration same as defined in the config yaml file

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
change command line argument name to 'kv-cache-transfer-latency'

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@mayabar mayabar requested review from shmuelk and irar2 July 14, 2025 13:12
)

DescribeTable("invalid configurations",
func(args []string) {
_, err := createSimConfig(args)
Expect(err).To(HaveOccurred())
},
Entry(tests[7].name, tests[7].args),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed a test here instead of only increasing the indices

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test 13


return c
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is confusing, and one function is enough. There is only one case for createDefaultBasicConfig, and we can just update the lora parameters in the test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basicConfig function was removed

// isDoRemoteDecode() returns true is do_remote_decode is true in the request, this means that this is prefill request
doRemoteDecode() bool
// isDoRemotePrefill() returns true is do_remote_prefill is true in the request, this means that this is decode request
doRemotePrefill() bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names in the comments don't match the actual names

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

RemoteBlockIds []string `json:"remote_block_ids"`
RemoteEngineId string `json:"remote_engine_id"`
RemoteHost string `json:"remote_host"`
RemotePort int `json:"remote_port"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding comments for the fields

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

RemoteBlockIds []string `json:"remote_block_ids"`
RemoteEngineId string `json:"remote_engine_id"`
RemoteHost string `json:"remote_host"`
RemotePort int `json:"remote_port"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -304,6 +306,8 @@ func (s *VllmSimulator) readRequest(ctx *fasthttp.RequestCtx, isChatCompletion b
var req textCompletionRequest

err := json.Unmarshal(ctx.Request.Body(), &req)

fmt.Printf("Unmarshaled text request: %#v\n", req)
return &req, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug printing?

@@ -477,16 +491,23 @@ func (s *VllmSimulator) reqProcessingWorker(ctx context.Context, id int) {
isChatCompletion: reqCtx.isChatCompletion,
model: displayModel,
},
responseTokens, toolCalls, finishReason, usageDataToSend,
responseTokens, toolCalls, finishReason, usageDataToSend, req.doRemotePrefill(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add doRemotePrefill to streamingContext?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, added

@@ -638,6 +671,14 @@ func (s *VllmSimulator) sendResponse(isChatCompletion bool, ctx *fasthttp.Reques
s.responseSentCallback(modelName)
}

// returns time to first token based on whether
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on whether what? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@mayabar mayabar requested a review from irar2 July 15, 2025 06:17
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
irar2
irar2 previously approved these changes Jul 15, 2025
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
irar2
irar2 previously approved these changes Jul 15, 2025
README.md Outdated
@@ -29,9 +29,9 @@ The simulator supports two modes of operation:
- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
- `random` mode: the response is randomly chosen from a set of pre-defined sentences.

Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`.
Timing of the response is defined by `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace: is defined by `time-to-first-token

with: is defined by the time-to-first-token

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

README.md Outdated

For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
For a request with `stream=true`: `time-to-first-token` or `kv-cache-transfer-latency` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.

For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wasn't kv-cache-transfer-latency mentioned here?

@@ -616,8 +649,8 @@ func (s *VllmSimulator) createCompletionResponse(isChatCompletion bool, respToke
// finishReason - a pointer to string that represents finish reason, can be nil, stop, length, or tools
// usageData - usage (tokens statistics) for this response
func (s *VllmSimulator) sendResponse(isChatCompletion bool, ctx *fasthttp.RequestCtx, respTokens []string, toolCalls []toolCall,
modelName string, finishReason string, usageData *usage) {
resp := s.createCompletionResponse(isChatCompletion, respTokens, toolCalls, &finishReason, usageData, modelName)
modelName string, finishReason string, usageData *usage, doRemoteDecode bool, doRemotePrefill bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names doRemoteDecode and doRemotePrefill are strange. Shouldn't they be something like doPrefillOnly and doDecodeOnly, respectively?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do-remote-decode and do-remote-prefill are field names of vLLM's request/response

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@mayabar mayabar merged commit 996dae3 into llm-d:main Jul 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants